⭐ 📅 📃 👉 📖 🤦♂️ 🖖 🤓
Star date 71676.31. Our mission is to use R statistical software to extract star dates mentioned in the captain’s log from the scripts of Star Trek: The Next Generation and observe their progression over the course of the show’s seven seasons. There appears to be some mismatch in the frequency of digits after the decimal point – could this indicate poor abillity to choose random numbers? Or something more sinister? We shall venture deep into uncharted territory for answers…
Download all the scripts from Star Trek: The Next Generation from the Star Trek Minutiae website. This is provided as a zipped folder containing 276 text files numbered consecutively from 102 to 277.
Ready the workspace by loading the packages we’ll need for data manipulation.
library(readr) # read files
library(purrr) # iterate functions over files
library(stringr) # manipulate strings
library(dplyr) # data manipulation and pipe opeartor (%>%)We’re going to extract the content of the the text files using the read_lines() function from the readr package. We’ll iterate over each file with the map() function from the purrr package to read them into a list object where each element is a script.
scripts <- purrr::map(
list.files( # create vector of filepath strings to each file
"data/scripts", # file location of the text files
full.names = TRUE # e.g. "data/scripts/102.txt"
),
readr::read_lines # read the content from each filepath
)We can take a look at some example lines ([17:34]) from the title page of the first script (element [[1]]).
scripts[[1]][17:34]## [1] " STAR TREK: THE NEXT GENERATION "
## [2] " "
## [3] " \"Encounter at Farpoint\" "
## [4] " "
## [5] " by "
## [6] " D.C. Fontana "
## [7] " and "
## [8] " Gene Roddenberry "
## [9] ""
## [10] ""
## [11] "This script is not for publicaion or reproduction."
## [12] "No one is authorized to dispose of the same. If lost or"
## [13] "destroyed, please notify the Script Department."
## [14] ""
## [15] ""
## [16] " FINAL DRAFT"
## [17] ""
## [18] " April 13, 1987"
Our first example of a star date is in the Captain’s log voiceover in lines 46 to 50 of the first script.
scripts[[1]][46:50]## [1] "\t\t\t\t\tPICARD V.O."
## [2] "\t\t\tCaptain's log, stardate 42353.7."
## [3] "\t\t\tOur destination is planet Cygnus"
## [4] "\t\t\tIV, beyond which lies the great"
## [5] "\t\t\tunexplored mass of the galaxy."
We want to extract stardate strings from each script. These are in the form XXXXX.X, where ‘X’ is a digit. We want to capture the stardates mentioned as part of the captain’s log voiceover where possible, so we can start our search pattern with ‘date’. This will help us avoid matching to strings that have a stardate-like pattern but aren’t stardates
We can extract these with str_extract_all() from the stringr package, using a regex (regular expression). Our regex is written date[:space:][[:digit:]\\.[:digit:]]{7}. This means ‘find a string that starts with the word date followed by a space (date), which is followed by a string that contains digits ([:digit:]) with a period (\\.) inside, with a total length of seven characters ({7})’.
This will provide a list object where each element contains the regex-matched string for a script.
stardate_extract <- stringr::str_extract_all(
scripts, # location from which to extract
pattern = "date[:space:][[:digit:]\\.[:digit:]]{7}" # regex
)
head(stardate_extract) # see the first few list elements## [[1]]
## [1] "date 42353.7" "date 42354.1" "date 42354.2" "date 42354.7"
## [5] "date 42372.5"
##
## [[2]]
## [1] "date 41209.2" "date 41209.3"
##
## [[3]]
## [1] "date 41235.2" "date 41235.3"
##
## [[4]]
## [1] "date 41294.5" "date 41294.7"
##
## [[5]]
## [1] "date 41263.1" "date 41263.2" "date 41263.3" "date 41263.4"
##
## [[6]]
## [1] "date 41194.6" "date 41194.8"
We’re now going to tody the data to:
tibble::enframe()) with one row per episodetidyr::unnest())dplyr::transmute()) and remove the instances of the string ‘date’ (stringr::str_replace())dplyr::mutate(case_when()))dplyr::if_else())dplyr::filter())stardate_tidy <- stardate_extract %>%
tibble::enframe() %>%
tidyr::unnest() %>%
dplyr::transmute(
episode = name,
stardate = stringr::str_replace(
string = value,
pattern = "date ",
replacement = ""
)
) %>%
dplyr::mutate(
season = as.character(
case_when(
episode %in% 1:25 ~ "1",
episode %in% 26:47 ~ "2",
episode %in% 48:73 ~ "3",
episode %in% 74:99 ~ "4",
episode %in% 100:125 ~ "5",
episode %in% 126:151 ~ "6",
episode %in% 152:176 ~ "7"
)
),
stardate = as.numeric(
dplyr::if_else(
condition = stardate %in% c("41148..", "40052..", "37650.."),
true = "NA",
false = stardate
)
)
) %>%
dplyr::filter(!is.na(stardate))
stardate_tidy## # A tibble: 263 x 3
## episode stardate season
## <int> <dbl> <chr>
## 1 1 42354. 1
## 2 1 42354. 1
## 3 1 42354. 1
## 4 1 42355. 1
## 5 1 42372. 1
## 6 2 41209. 1
## 7 2 41209. 1
## 8 3 41235. 1
## 9 3 41235. 1
## 10 4 41294. 1
## # ... with 253 more rows
Let’s visualise the stardates by episode.
We can make this interactive using plot
We can use the theme
library(ggplot2)
library(plotly)
library(ggsci)
library(ggthemes)
stardate_dotplot <- stardate_tidy %>%
ggplot2::ggplot() +
geom_point(aes(x = episode, y = stardate, color = season)) +
labs(title = "Stardates are almost but not quite chronological") +
theme_solarized_2(light = FALSE) +
scale_color_startrek()
plotly::ggplotly(
stardate_dotplot,
tooltip = c("stardate", "episode", "season")
)Extract them.
stardate_tidy_decimal <- stardate_tidy %>%
mutate(
stardate_decimal = as.numeric(
str_sub(
as.character(stardate),
7,
7
)
),
stardate_decimal = ifelse(
is.na(stardate_decimal),
0,
stardate_decimal
)
) %>%
select(season, episode, stardate, stardate_decimal)Datatable of them.
library(DT)
stardate_tidy_decimal %>%
mutate(season = as.factor(season)) %>%
DT::datatable(
filter = "top",
extensions = 'Buttons',
options = list(
autoWidth = TRUE, # column width consistent when making selections
dom = "Blfrtip",
buttons =
list("copy", list(
extend = "collection",
buttons = c("csv", "excel", "pdf"),
text = "Download"
)
),
# customize the length menu
lengthMenu = list(
c(10, 25, 50, -1), # declare values
c(10, 25, 50, "All") # declare titles
), # end of lengthMenu customization
pageLength = 10
)
)Do a barplot.
stardate_tidy_decimal %>%
ggplot2::ggplot() +
geom_bar(aes(as.character(stardate_decimal)), fill = "#CC0C00FF") +
labs(
title = "Decimals one to three are most frequent and zero the least frequent",
x = "stardate decimal value"
) +
theme_dark() +
theme_solarized_2(light = FALSE)stardate_tidy_decimal %>%
ggplot2::ggplot() +
geom_bar(
aes(as.character(stardate_decimal)),
fill= c(
rep("#CC0C00FF", 10),
rep("#5C88DAFF", 9),
rep("#84BD00FF", 10),
rep("#FFCD00FF", 9),
rep("#7C878EFF", 10),
rep("#00B5E2FF", 8),
rep("#00AF66FF", 8)
)
) +
labs(
title = "There's a similar pattern of decimal stardate frequency across seasons",
x = "stardate decimal value"
) +
facet_wrap(~ season) +
theme_solarized_2(light = FALSE) +
scale_color_startrek()sessionInfo()## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.4.5 ggthemes_3.4.0 ggsci_2.8
## [4] plotly_4.7.1 ggplot2_2.2.1.9000 bindrcpp_0.2
## [7] dplyr_0.7.4 stringr_1.2.0 purrr_0.2.4
## [10] readr_1.1.1 emo_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.15 pillar_1.2.1 compiler_3.4.3
## [4] plyr_1.8.4 bindr_0.1 tools_3.4.3
## [7] digest_0.6.15 viridisLite_0.3.0 jsonlite_1.5
## [10] lubridate_1.7.2 evaluate_0.10.1 tibble_1.4.2
## [13] gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.0
## [16] shiny_1.0.5 cli_1.0.0 crosstalk_1.0.1
## [19] yaml_2.1.18 httr_1.3.1 withr_2.1.1.9000
## [22] knitr_1.18 htmlwidgets_1.0 hms_0.3
## [25] rprojroot_1.2 grid_3.4.3 data.table_1.10.4-2
## [28] glue_1.2.0 R6_2.2.2 rmarkdown_1.6
## [31] tidyr_0.7.2 magrittr_1.5 backports_1.1.1
## [34] scales_0.5.0.9000 htmltools_0.3.6 assertthat_0.2.0
## [37] xtable_1.8-2 mime_0.5 colorspace_1.3-2
## [40] httpuv_1.3.5 labeling_0.3 utf8_1.1.3
## [43] stringi_1.1.6 lazyeval_0.2.1 munsell_0.4.3
## [46] crayon_1.3.4
The star date for today’s date (7 March 2018) as calculated using the trekguide.com method↩